Trainable, Scalable Summarization Using Robust NLP and Machine Learning
نویسندگان
چکیده
We describe a trainable and scalable summarization system which utilizes features derived from information retrieval, inibrmation extraction, and NLP techniques and on-line resources. The system con> bines these features using a trainable feature combiner learned from summary examples through a machine learning algorithm. We demonstrate system scalability by reporting results on the best combination of summarizat ion features for different document sources. We also present preliminary results from a task-based evaluation on summarization outpnt usability. 1 I n t r o d u c t i o n Frequency-based (Edmundson, 196(.); Kupiec, Pedersen, and Chen, 1995; Brandow, Mitze. and Rau, 1995), knowledge-based (Reimer and Hahn, 1988; McKeown and Radev, 1995), and discoursebased (Johnson et al., 1993; Miike et al., 1994; Jones, 1995) approaches to automated summarization correspond to a continuum of increasing understanding of the text and increasing complexity in text processing. Given the goal of machine-generated summaries, these approaches a t tempt to answer three central questions: • How does the system count words to calculate worthiness for summarization? • How does the system incorporate the knowledge of the domain represented in tile text? • How does the system create a coherent and cohesive summary? Our work leverages off of research in these three approaches and a t tempts to remedy some of the difficulties encountered in each by applying a combination of information retrieval, information extraction, *We would like to thank Ja.mie Callan for his help with the INQUERY experiments. and NLP techniques and on-line resources with nmchine learning to generate summaries. Our DimSum system follows a common paradigm of sentence extraction, but automates acquiring candidate knowledge and learns what knowledge is necessary to sun> inarize. We present how we automatically acquire caudidate features in Section 2. Section 3 describes our training methodology for combining features to generate summaries, and discusses evaluation results of both batch and machine learning methods. Section 4 reports our task-based evalnation. 2 E x t r a c t i n g F e a t u r e s Ill this section, we describe how the system counts linguistically-motivated, autornaticallyderived words and nmlti-words in calculating worthiness for smnmarizat.ion. We show how tile systetll uses an external corpus t.o incorporate domain knowledge in contrast to text-only statistics. Finally, we explain how we a t tempt to increase the co hesiveness of our summaries by using name aliasing, WordNet synonyms, and morphological variants. 2.1 D e f i n i n g Single a n d M u l t i w o r d T e r m s Frequency-based summarizat ion systems typically use a single word string as the unit for counting fl'equency. Though robust, such a method ignores the semantic content of words and their potential men> bership in multi-word phrases and may introduce noise in frequency counting by treating the same strings uniformly regardless of context. Our approach, similar to (Tzoukerman, Klavans, and aacquemin, 1997), is to apply NLP tools to extract multi-word phrases automatically with high accuracy and use them as the basic unit in the summarization process, including frequency calculation. Our system uses both text statistics (term frequency, or /.at) and corpus statistics (inverse document frequency, or idJ) (Salton and McGill, 1983) to derive sigTzal~zrc words as one of the sunmlarization features. If single words were the sole basis of counting for our summarizat ion application, noise would be
منابع مشابه
Building a Trainable Multi-document Summarizer
This paper describes an approach to building a trainable multi-document summarization system, using a simple training process based on support vector machines. The summarization system is trained and tested using the DUC 2005 data set. The evaluation results based on ROUGE scores are presented and methods for improving the performance of the summarization system are identified.
متن کاملAutomatic Text Summarization Using a Machine Learning Approach
In this paper we address the automatic summarization task. Recent research works on extractive-summary generation employ some heuristics, but few works indicate how to select the relevant features. We will present a summarization procedure based on the application of trainable Machine Learning algorithms which employs a set of features extracted directly from the original text. These features a...
متن کاملBUILDING AN EFFICIENT, SCALABLE, AND TRAINABLE PROBABILITY-AND-RULE- BASED PART-OF-SPEECH TAGGER OF HIGH ACCURACY by
This project is aimed to build an efficient, scalable, portable, and trainable part-of-speech tagger. Using 98% of Penn Treebank-3 as the training data, it builds a raw tagger, using Bayes’ theorem, a hidden Markov model, and the Viterbi algorithm. After that, a reinforcement machine learning algorithm and contextual transformation rules were applied to increase the tagger’s accuracy. The tagge...
متن کاملEfficient calculation of sentence semantic similarity: a proposed scheme based on machine learning approaches and NLP techniques
Aim of Study Sentence semantic similarity plays a crucial role in a variety of applications such as Machine Translation, Information Retrieval, Question Answering and Multi-document Summarization. Considering the variability of natural language expression, sentence semantic similarity detection is not a trivial task. This paper tries to make use of Natural Language Processing (NLP) as well as m...
متن کامل545 Machine Learning , Fall 2011 Final Project
This project aims at applying neural network-based deep learning to the problem of extractive text summarization. Our work is inspired by the work of Collobert and Weston [Collobert et al., 2011], who created a unified deep learning architecture to learn several common NLP tasks. In this report, we give the motivation behind our work, describe our problem formulation and present some results.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998